StreamInterface: prevent socket/reader-thread leak on handshake failure by nightjoker7 · Pull Request #918 · meshtastic/python

nightjoker7 · 2026-04-23T00:09:09Z

Problem

StreamInterface.__init__ calls self.connect() which spawns the reader thread and opens the stream, then calls self.waitForConfig() which blocks for the config handshake. If either raises — handshake timeout on a busy/loaded node is the common case — the partially-constructed object propagates the exception up, but:

the reader thread spawned in connect() keeps running
the underlying stream (socket on TCPInterface, serial.Serial on SerialInterface, etc.) stays open
the caller never received a reference to self, so they cannot call close() themselves

In a caller that retries on failure (passive loggers, reconnect loops, monitoring daemons), the thread + stream leak compounds with every retry. On the firmware side, every re-attempted handshake triggers a fresh config-dump + NodeDB-dump on the node, so the client-side leak directly inflates heap pressure on the radio.

Fix

Two small changes in meshtastic/stream_interface.py:

__init__: wrap connect() + waitForConfig() in try/except that calls self.close() on any exception before re-raising. This ensures the reader thread and the stream are released even when __init__ cannot return normally.
close(): guard self._rxThread.join() against RuntimeError — happens when close() runs from a failed __init__ before connect() spawned the reader thread.

How it was found

Hit this leak on a fleet-monitoring daemon retrying TCPInterface(host) every 15 seconds against a Station-G2 node with NodeDB at MAX_NUM_NODES=250. Each handshake timed out, each retry leaked a reader thread + socket, and the compound reconnect pressure on the node correlated with cascading reboots. After applying this fix client-side (together with firmware null-checks in meshtastic/firmware#10226 and #10229), the same deployment has been stable.

Minimal repro: run TCPInterface('<host that accepts TCP but does not respond to meshtastic protocol>') in a retry loop and watch threading.active_count() climb.

Backwards compatibility

Happy-path behavior is unchanged. On the failure path, callers that previously received an exception from __init__ and did nothing will now see the same exception but without the orphaned thread + stream. Callers that relied on the thread staying alive after a failed __init__ (unlikely — they had no reference to it) would notice, but that would be an unusual pattern.

MeshService: null-check sender node before dereferencing has_user on hot RX path firmware#10229 — null-check sender node on hot RX path (real-world evidence from the same deployment where this was found)
Router: null-check destination node before dereferencing public_key in PKI decode firmware#10226 — null-check destination node in Router::perhapsDecode PKI path
SX126x: re-apply 0x8B5 register in resetAGC() to preserve RX sensitivity firmware#10219 — SX126x 0x8B5 register restore (already merged)

…re in __init__ If connect() or waitForConfig() raises during __init__ (handshake timeout, bad stream, config error), the reader thread started by connect() keeps running and the underlying stream/socket stays open — but the caller never receives a reference to the half-initialized instance, so they cannot call close() themselves. The leak compounds on every retry from a caller's reconnect loop. Fix: wrap connect() + waitForConfig() in try/except; call self.close() on any exception before re-raising. Also guard close() against RuntimeError from joining an unstarted reader thread (happens when close() runs from a failed __init__ before connect() could spawn it). Discovered while debugging a real-world Meshtastic firmware crash where a passive logger's retrying TCPInterface() calls against a node with 250-entry NodeDB produced a reconnect storm — every retry triggered a full config+NodeDB dump on the node, compounding heap pressure, which then exposed null-deref bugs in Router::perhapsDecode / MeshService (firmware side fixed in meshtastic/firmware#10226 and #10229). The client-side leak is independent of those firmware bugs and worth fixing on its own.

Copilot

Pull request overview

This PR addresses a resource-leak failure mode in StreamInterface where exceptions during initial protocol handshake (connect()/waitForConfig()) could leave a running reader thread and an open underlying transport (serial/TCP), even though the constructor raised before returning an instance to the caller.

Changes:

Wrap connect() + waitForConfig() in StreamInterface.__init__ with exception handling that attempts to close() before re-raising.
Make StreamInterface.close() tolerate RuntimeError from joining an unstarted reader thread.

Copilot · 2026-04-23T23:20:03Z

+                # Nothing to join; safe to ignore.
+                pass


If connect() fails before the reader thread is started (e.g., serial write raises before _rxThread.start()), this RuntimeError path will skip joining but also won’t close the underlying stream. For SerialInterface (which opens the port before calling StreamInterface.__init__), this means the serial port can remain open even though close() was invoked. Consider explicitly closing self.stream (and setting it to None) when the thread was never started / is not alive, instead of just pass.

Suggested change

# Nothing to join; safe to ignore.

pass

# In that case the reader thread cannot run _disconnected(),

# so close the stream here to avoid leaking the resource.

if self.stream is not None:

with contextlib.suppress(Exception):

self.stream.close()

self.stream = None

Copilot · 2026-04-23T23:20:04Z

        if self._rxThread != threading.current_thread():
-            self._rxThread.join()  # wait for it to exit
+            try:
+                self._rxThread.join()  # wait for it to exit
+            except RuntimeError:


join() without a timeout can block indefinitely if the reader thread is stuck in a blocking _readBytes() (e.g., TCPInterface uses blocking socket.recv() with no timeout). With the new __init__ exception handler calling self.close(), a handshake timeout against a silent TCP peer can hang here and never re-raise the original exception. Consider using a bounded join (and letting transport-specific close() logic interrupt the read), or otherwise ensuring blocking reads are interrupted before joining.

Copilot · 2026-04-23T23:20:04Z

+            try:
+                self.connect()
+                if not noProto:
+                    self.waitForConfig()
+            except Exception:
+                # Handshake failed (timeout, config error, bad stream). The caller
+                # never receives a reference to this half-initialized instance, so
+                # they cannot call close() themselves. If we don't clean up here,
+                # the reader thread (already started by connect()) keeps running
+                # and the underlying stream/socket leaks — the leak compounds on
+                # every retry from the caller's reconnect loop.
+                with contextlib.suppress(Exception):
+                    self.close()
+                raise


New failure-path behavior (calling close() from __init__ when connect() / waitForConfig() raises) is important for leak prevention but currently isn’t covered by unit tests. Since there are existing tests for stream_interface.py, consider adding a test that forces connect() or waitForConfig() to raise and asserts cleanup occurs (e.g., close() is invoked and doesn’t raise when the thread wasn’t started).

codecov · 2026-04-23T23:23:07Z

Codecov Report

❌ Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.52%. Comparing base (cec79a7) to head (9011baa).

Files with missing lines	Patch %	Lines
meshtastic/stream_interface.py	46.15%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #918      +/-   ##
==========================================
- Coverage   61.57%   61.52%   -0.06%     
==========================================
  Files          25       25              
  Lines        4448     4457       +9     
==========================================
+ Hits         2739     2742       +3     
- Misses       1709     1715       +6

Flag	Coverage Δ
unittests	`61.52% <46.15%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

thebentern requested a review from Copilot April 23, 2026 23:15

Copilot started reviewing on behalf of thebentern April 23, 2026 23:15 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamInterface: prevent socket/reader-thread leak on handshake failure#918

StreamInterface: prevent socket/reader-thread leak on handshake failure#918
nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
nightjoker7:fix-stream-handshake-leak

nightjoker7 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

codecov Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                # Nothing to join; safe to ignore.
-                pass
+                # In that case the reader thread cannot run _disconnected(),
+                # so close the stream here to avoid leaking the resource.
+                if self.stream is not None:
+                    with contextlib.suppress(Exception):
+                        self.stream.close()
+                    self.stream = None

Conversation

nightjoker7 commented Apr 23, 2026

Problem

Fix

How it was found

Backwards compatibility

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 23, 2026 •

edited

Loading